基于多元线性回归和支持向量机的HCV NS3/4A蛋白酶抑制剂的生物活性值的QSAR研究

QSAR studies of the bioactivity of hepatitis C virus (HCV) NS3/4A protease inhibitors by multiple linear regression (MLR) and support vector machine (SVM)

Qin, Z.J.; Wang, M.L.; Yan, A.X.*
Bioorganic & Medicinal Chemistry Letters, 2017, 27, 2931–2938.

    本研究使用了多种选择描述符组和训练/测试集的方法,采用多元线性回归(MLR)和支持向量机(SVM) 两种机器学习算法建立定量构效关系(QSAR)模型,以预测丙型肝炎病毒(HCV) NS3/4A蛋白酶抑制剂的生物活性。收集了已报道的512个HCV NS3/4A蛋白酶抑制剂及生物活性IC50值(由相同的FRET法测定得到)构建数据集。应用CORINA Symphony程序计算每个分子的9个全局描述符和12个二维自相关描述符进行表征。采用随机划分和Kohonen自组织映射(SOM)方法将数据集划分为训练集和测试集。最佳的MLR模型对训练集和测试集的相关系数(r2)分别为0.75和0.72,而最佳SVM模型对训练集和测试集的相关系数分别为0.87和0.85。此外,还开发了一系列子数据集模型。结果显示出所有子模型的预测效果均优于原模型。我们认为,将最优子模型和整个数据集的SVM模型进行组合可以作为研发新型NS3/4A蛋白酶抑制剂骨架的可靠先导设计工具。

阅读文章原文

下载原始数据

Download Supporting Information

    In this study, quantitative structure-activity relationship (QSAR) models using various descriptor sets and training/test set selection methods were explored to predict the bioactivity of hepatitis C virus (HCV) NS3/4A protease inhibitors by using a multiple linear regression (MLR) and a support vector machine (SVM) method. 512 HCV NS3/4A protease inhibitors and their IC50 values which were determined by the same FRET assay were collected from the reported literature to build a dataset. All the inhibitors were represented with selected nine global and 12 2D property-weighted autocorrelation descriptors calculated from the program CORINA Symphony. The dataset was divided into a training set and a test set by a random and a Kohonen’s self-organizing map (SOM) method. The correlation coefficients (r2) of training sets and test sets were 0.75 and 0.72 for the best MLR model, 0.87 and 0.85 for the best SVM model, respectively. In addition, a series of sub-dataset models were also developed. The performances of all the best sub-dataset models were better than those of the whole dataset models. We believe that the combination of the best sub- and whole dataset SVM models can be used as reliable lead designing tools for new NS3/4A protease inhibitors scaffolds in a drug discovery pipeline.

Read More

QSAR Models performance:   Dataset 1 (512 inhibitors)

Model Name Algorithm Descriptors Spliting methods Training set r2 Training set sd Training set MAE Test set r2 Test set sd Test set MAE
Model A1 MLR 7 CORINA Global Random 0.67 0.95 0.73 0.58 1.05 0.76
Model A2 MLR 7 CORINA Global Kohonen’s self-organizing map (SOM) 0.64 0.98 0.76 0.65 0.95 0.74
Model B1 MLR 7 CORINA Global Random 0.64 0.98 0.77 0.54 1.10 0.81
Model B2 MLR 7 CORINA Global Kohonen’s self-organizing map (SOM) 0.60 1.03 0.81 0.63 0.98 0.75
Model C1 MLR 2 CORINA Global 8 CORINA 2D Random 0.77 0.80 0.62 0.67 0.93 0.65
Model C2 MLR 2 CORINA Global 8 CORINA 2D Kohonen’s self-organizing map (SOM) 0.75 0.82 0.62 0.72 0.87 0.66
Model D1 MLR 2 CORINA Global 9 CORINA 2D Random 0.75 0.83 0.65 0.67 0.94 0.68
Model D2 MLR 2 CORINA Global 9 CORINA 2D Kohonen’s self-organizing map (SOM) 0.73 0.86 0.66 0.72 0.87 0.65
Model A3 SVM 7 CORINA Global Random 0.80 0.73 0.53 0.72 0.84 0.60
Model A4 SVM 7 CORINA Global Kohonen’s self-organizing map (SOM) 0.78 0.77 0.55 0.79 0.72 0.56
Model B3 SVM 7 CORINA Global Random 0.81 0.72 0.52 0.73 0.83 0.58
Model B4 SVM 7 CORINA Global Kohonen’s self-organizing map (SOM) 0.79 0.75 0.52 0.80 0.73 0.53
Model C3 SVM 2 CORINA Global 8 CORINA 2D Random 0.90 0.54 0.42 0.75 0.82 0.55
Model C4 SVM 2 CORINA Global 8 CORINA 2D Kohonen’s self-organizing map (SOM) 0.88 0.56 0.40 0.83 0.68 0.50
Model D3 SVM 2 CORINA Global 9 CORINA 2D Random 0.90 0.53 0.41 0.81 0.70 0.49
Model D4 SVM 2 CORINA Global 9 CORINA 2D Kohonen’s self-organizing map (SOM) 0.87 0.59 0.42 0.85 0.63 0.47

QSAR models:    Dataset 2 (355 linear inhibitors from dataset1)

Model Spliting methods Algorithm Descriptors Training set r2 Training set sd Training set MAE Test set r2 Test set sd Test set MAE
Model C2 (for predicting 355 linear inhibitors) Kohonen’s self-organizing map (SOM) MLR 2 CORINA Global 8 CORINA 2D 0.74 0.84 0.62 0.68 0.91 0.67
Model LA1 Kohonen’s self-organizing map (SOM) MLR 2 CORINA Global 8 CORINA 2D 0.77 0.78 0.59 0.77 0.77 0.59
Model D4 (for predicting 355 linear inhibitors) Kohonen’s self-organizing map (SOM) SVM 2 CORINA Global 8 CORINA 2D 0.86 0.62 0.44 0.83 0.68 0.49
Model LB2 Kohonen’s self-organizing map (SOM) SVM 2 CORINA Global 8 CORINA 2D 0.87 0.59 0.43 0.85 0.63 0.45

QSAR models:    Dataset 3 (157 macrocyclic inhibitors from dataset1)

Model Spliting methods Algorithm Descriptors Training set r2 Training set sd Training set MAE Test set r2 Test set sd Test set MAE
Model C2 (for predicting 157 macrocyclic inhibitors) Kohonen’s self-organizing map (SOM) MLR 2 CORINA Global 8 CORINA 2D 0.29 0.81 0.60 0.32 0.86 0.62
Model MC1 Kohonen’s self-organizing map (SOM) MLR 2 CORINA Global 8 CORINA 2D 0.58 0.57 0.41 0.47 0.66 0.47
Model D4 (for predicting 157 macrocyclic inhibitors) Kohonen’s self-organizing map (SOM) SVM 2 CORINA Global 8 CORINA 2D 0.60 0.56 0.39 0.55 0.62 0.41
Model MD2 Kohonen’s self-organizing map (SOM) SVM 2 CORINA Global 8 CORINA 2D 0.76 0.45 0.28 0.67 0.50 0.35

主要项目成员

秦子健

博士研究生

zijianqin@foxmail.com